Raw data points don’t provide much insight to kick off data analysis.
Data Visualization is brilliant in * exploring the pattern of data briefly at the early stage * in the final conclusion, enhance the story telling of data analysis
Like the picture above shows, in the
stage of Understand (exploring the data sets),
Transform, Visualize and Model are used in an iterative manner so as to
get the best early understaning about the data sets.
The advantage of using built-in plotting utilities is they are easy.
It let you quickly visualize the data pattern while you are trying to gain a brief insight at the early stage of your workflow.
plot(iris)
For built-in R data visualization, go to the R Intro project on Github to refresh your memory (R Intro Source Codes)[https://github.com/ngsanluk/R-Intro]
If the built-in plotting tools are not enough for you, go for ggplot2. It is the most popular data visualization for R.
ggplot2 is an open-source data visualization package for R. ggplot2 breaks up graphs into semantic components such as scales and layers. Since 2005, ggplot2 has grown in use to become one of the most popular R packages.
if (!require("pacman")) install.packages("pacman") # check if pacman already installed. If not, install it.
Loading required package: pacman
pacman::p_load(
pacman, # package manager
datasets, # built-in data sets
rio, # r input / output
magrittr, # for piping commands
tidyverse,
modelr # mathematics model
) # Load required packages. If they are NOT already downloaded, download them automatically.
ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same components below:
a data set + a coordinate system + and geom—visual that represent data points
Grammar of Graphics
Let’s use the built-in cars data set for a simple ggplot
print(cars)
cars %>% ggplot()
Error in cars %>% ggplot() : could not find function "%>%"
It’s easy to add geometry layer to the base co-ordinate Let’s ADD a layer of data points using geom_point() function.
And Yes, you can ADD a layer by using the + operator We use geom_point() function that requires x and y value for each geom point. In 2D co-ordinate, a point is described by its x and y value.
We need to provide a mapping that specifies the data columns’ name to map to a point’s x and y value
That mapping is defined by an aesthetics function: aes()
We usually name the plot by the name of geom that we use to represent data. In this case, it’s widely called Scatter Plot. It is useful to explore the relation of two variables.
cars %>% ggplot() + # NOTE: '+' operator must be placed at the end of a line
geom_point(mapping = aes(x=speed, y=dist))
geom_point() and geom_line() require very similar parameters. geom_line() is simply an enhanced visualization that automatically connect all the points. Line chart are used to explore/emphasize trending.
# just change geom_point to geom_line without change anything else
cars %>% ggplot() +
geom_line(mapping = aes(x=speed, y=dist))
geom_smooth() and geom_point() require very
similar parameters. geom_smooth() smooths out the line
progression
# just change geom_point to geom_line without change anything else
cars %>% ggplot() +
geom_smooth(mapping = aes(x=speed, y=dist))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
a geom is the geometrical object that a plot use to represent data. We used points, lines and smooth above on the same data set and they provide very different message.
We usually name the plot by the name of geom that we use to represent data.
# just change geom_point to geom_line without change anything else
cars %>% ggplot() +
geom_point(mapping = aes(x=speed, y=dist)) +
geom_smooth(mapping = aes(x=speed, y=dist))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
Un-comment the extra parameters to add more aesthetics to your plot
cars %>% ggplot() +
geom_point(mapping = aes(x=speed, y=dist),
color = "orange", # the color of data points
# size = 3, # the size of data point
# alpha = 0.5, # the transparency of data points, min=0, max=1
# shape = 0, # the shape of data point
)
allowance = read_csv("./data/allowance.csv")
Rows: 11 Columns: 13
── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Assessment_Year, Personal_Disability_Allowance
dbl (11): Basic, Married_Person, Child, Child_newborn, Dependent_Brother_Sister,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(allowance)
allowance %>%
ggplot() +
geom_point(
mapping=aes(x=Assessment_Year, y=Basic),
color = "orange",
size = 3
)
Continuous values refer to numbers value that has wide range. E.g. salary, height.
Discrete values refer to a limited number of valid values. It can be string. It can be a few distinct numbers.
When you produce plots, pay attention to what type of value are required by the geom object.
In many cases, you will need to convert the data first.
mutate() function are quite often used for that.
example:
allowance = allowance %>%
mutate(Assessment_Year = as.numeric(substr(Assessment_Year, 1 ,4)))
allowance %>% ggplot() +
geom_line(mapping = aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
geom_point(mapping = aes(x=Assessment_Year, y=Basic, group=1, color="Orange"))
# As both geom use the same data mapping, the above statements can be simplified as
allowance %>% ggplot(aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
geom_line() +
geom_point(size=5)
allowance %>% ggplot(aes(x=Assessment_Year, y=Basic, group=1, color="Orange")) +
geom_smooth() +
geom_point(size=5)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
my.first.plot = allowance %>%
ggplot() +
geom_point(
mapping=aes(x=Assessment_Year, y=Basic),
color = "orange",
size = 3,
)
print(my.first.plot)
ggsave("./output/my_first_plot.png") # default image size
Saving 6.62 x 4.09 in image
ggsave("./output/my_first_plot_large.png", width=4000, height=2000, unit="px")
allowance %>%
ggplot() +
geom_col(mapping=aes(x=Assessment_Year, y=Basic),
fill="tomato")
geom_bar() is used for counting the frequency of each occurrence of observed value. It’s usually for counting a limit set of values
allowance %>%
ggplot() +
geom_bar(mapping=aes(x=Basic))
add line plot for column of ‘Child’ in the same plot, add another line plot for ‘Dependent_Parent_60’
graduates = read_csv("./data/graduates.csv")
Rows: 653 Columns: 5
── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): AcademicYear, LevelOfStudy, ProgrammeCategory, Sex
dbl (1): Headcount
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(graduates)
Let’s explore the data with some simple ggplot plot. Just some quick exploration. They might not be very useful.
Use line plot to explore the trending of “Business and Management” student headcount trending
Use line plots to compare female undergraduate students headcount
trending in “ProgrammeCategory” of “Business and Management” and
“Engineering and Technology” Use filter() to extract required records
You can use multiple filter() call Use &, | or multiple
conditions
graduates %>%
.$ProgrammeCategory %>% # . means the parameter from previous command
unique() # display the unique names of ProgrammeCategory
[1] "Arts and Humanities" "Business and Management"
[3] "Education" "Engineering and Technology"
[5] "Medicine, Dentistry and Health" "Sciences"
[7] "Social Sciences"
graduates %>%
filter(LevelOfStudy=="Undergraduate", Sex=="F") %>%
filter(ProgrammeCategory=="Business and Management" | ProgrammeCategory=="Engineering and Technology") %>%
print() # Test extracting and printing the required records.
graduates %>%
filter(LevelOfStudy=="Undergraduate", Sex=="F") %>%
filter(ProgrammeCategory=="Business and Management" | ProgrammeCategory=="Engineering and Technology") %>%
ggplot(
aes(x=AcademicYear,
y=Headcount,
group=ProgrammeCategory,
color=ProgrammeCategory
)
) +
geom_line() +
geom_point()
NA
library(jsonlite) # load package
Attaching package: ‘jsonlite’
The following object is masked from ‘package:purrr’:
flatten
hkma.interbank.url = "https://api.hkma.gov.hk/public/market-data-and-statistics/daily-monetary-statistics/daily-figures-interbank-liquidity"
interbank.liquidity = fromJSON(hkma.interbank.url)
# the above retrieval will take a while. The server response is slow.
summary(interbank.liquidity)
Length Class Mode
header 3 -none- list
result 2 -none- list
str(interbank.liquidity)
List of 2
$ header:List of 3
..$ success : logi TRUE
..$ err_code: chr "0000"
..$ err_msg : chr "No error found"
$ result:List of 2
..$ datasize: int 100
..$ records :'data.frame': 100 obs. of 44 variables:
.. ..$ end_of_date : chr [1:100] "2022-04-11" "2022-04-08" "2022-04-07" "2022-04-06" ...
.. ..$ cu_weakside : num [1:100] 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 ...
.. ..$ cu_strongside : num [1:100] 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 ...
.. ..$ disc_win_base_rate : num [1:100] 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 ...
.. ..$ hibor_overnight : num [1:100] 0.03 0.02 0.02 0.02 0.01 0.02 0.03 0.03 0.02 0.02 ...
.. ..$ hibor_fixing_1m : num [1:100] 0.188 0.185 0.188 0.196 0.206 ...
.. ..$ twi : num [1:100] 96 95.8 95.6 95.7 95.6 95.5 95.2 95.3 95.7 95.9 ...
.. ..$ opening_balance : int [1:100] 337554 337554 337554 337551 337551 337551 337551 337534 337534 337534 ...
.. ..$ closing_balance : int [1:100] 337554 337554 337554 337554 337551 337551 337551 337551 337534 337534 ...
.. ..$ market_activities : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ interest_payment : chr [1:100] "+0" "+0" "+0" "+3" ...
.. ..$ discount_window_reversal : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ discount_window_activities : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ intraday_movements_of_aggregate_balance_at_0930: int [1:100] 371125 375503 377226 375387 381938 371916 363729 363880 350696 357104 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1000: int [1:100] 375339 379266 385469 381909 386912 373675 367985 366723 358279 358428 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1100: int [1:100] 398461 398028 395229 406391 391314 395677 388202 383598 376635 375870 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1200: int [1:100] 404220 404131 403552 353051 404440 399350 402104 331056 383952 385132 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1500: int [1:100] 402494 400980 401072 348658 405804 398916 400683 334817 391180 389282 ...
.. ..$ intraday_movements_of_aggregate_balance_at_1600: int [1:100] 403079 400951 402067 353770 408927 399310 404832 337197 392468 391754 ...
.. ..$ forex_trans_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ other_market_activities_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ reversal_of_discount_window_t1 : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ interest_payment_issuance_efbn_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ forecast_aggregate_bal_t1 : int [1:100] 337554 337554 337554 337554 337554 337551 337551 337551 337551 337534 ...
.. ..$ forex_trans_t2 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ other_market_activities_t2 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ reversal_of_discount_window_t2 : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ interest_payment_issuance_efbn_t2 : chr [1:100] "+32" "+0" "+0" "+0" ...
.. ..$ forecast_aggregate_bal_t2 : int [1:100] 337586 337554 337554 337554 337554 337534 337551 337551 337551 337541 ...
.. ..$ forex_trans_t3 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ other_market_activities_t3 : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ reversal_of_discount_window_t3 : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ interest_payment_issuance_efbn_t3 : chr [1:100] "+0" "+29" "+0" "+0" ...
.. ..$ forecast_aggregate_bal_t3 : int [1:100] 337586 337583 337554 337554 337554 337534 337534 337551 337551 337541 ...
.. ..$ forex_trans_t4 : chr [1:100] NA NA NA NA ...
.. ..$ other_market_activities_t4 : chr [1:100] NA NA NA NA ...
.. ..$ reversal_of_discount_window_t4 : chr [1:100] NA NA NA NA ...
.. ..$ interest_payment_issuance_efbn_t4 : chr [1:100] NA NA NA NA ...
.. ..$ forecast_aggregate_bal_t4 : int [1:100] NA NA NA NA NA NA NA NA NA NA ...
.. ..$ forex_trans_u : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ other_market_activities_u : chr [1:100] "+0" "+0" "+0" "+0" ...
.. ..$ reversal_of_discount_window_u : chr [1:100] "-0" "-0" "-0" "-0" ...
.. ..$ interest_payment_issuance_efbn_u : chr [1:100] "-107" "-104" "-75" "-75" ...
.. ..$ forecast_aggregate_bal_u : int [1:100] 337479 337479 337479 337479 337479 337479 337479 337479 337479 337479 ...
interbank.liquidity$result
$datasize
[1] 100
$records
str(interbank.liquidity$result)
List of 2
$ datasize: int 100
$ records :'data.frame': 100 obs. of 44 variables:
..$ end_of_date : chr [1:100] "2022-04-11" "2022-04-08" "2022-04-07" "2022-04-06" ...
..$ cu_weakside : num [1:100] 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 7.85 ...
..$ cu_strongside : num [1:100] 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 7.75 ...
..$ disc_win_base_rate : num [1:100] 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 ...
..$ hibor_overnight : num [1:100] 0.03 0.02 0.02 0.02 0.01 0.02 0.03 0.03 0.02 0.02 ...
..$ hibor_fixing_1m : num [1:100] 0.188 0.185 0.188 0.196 0.206 ...
..$ twi : num [1:100] 96 95.8 95.6 95.7 95.6 95.5 95.2 95.3 95.7 95.9 ...
..$ opening_balance : int [1:100] 337554 337554 337554 337551 337551 337551 337551 337534 337534 337534 ...
..$ closing_balance : int [1:100] 337554 337554 337554 337554 337551 337551 337551 337551 337534 337534 ...
..$ market_activities : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ interest_payment : chr [1:100] "+0" "+0" "+0" "+3" ...
..$ discount_window_reversal : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ discount_window_activities : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ intraday_movements_of_aggregate_balance_at_0930: int [1:100] 371125 375503 377226 375387 381938 371916 363729 363880 350696 357104 ...
..$ intraday_movements_of_aggregate_balance_at_1000: int [1:100] 375339 379266 385469 381909 386912 373675 367985 366723 358279 358428 ...
..$ intraday_movements_of_aggregate_balance_at_1100: int [1:100] 398461 398028 395229 406391 391314 395677 388202 383598 376635 375870 ...
..$ intraday_movements_of_aggregate_balance_at_1200: int [1:100] 404220 404131 403552 353051 404440 399350 402104 331056 383952 385132 ...
..$ intraday_movements_of_aggregate_balance_at_1500: int [1:100] 402494 400980 401072 348658 405804 398916 400683 334817 391180 389282 ...
..$ intraday_movements_of_aggregate_balance_at_1600: int [1:100] 403079 400951 402067 353770 408927 399310 404832 337197 392468 391754 ...
..$ forex_trans_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ other_market_activities_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ reversal_of_discount_window_t1 : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ interest_payment_issuance_efbn_t1 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ forecast_aggregate_bal_t1 : int [1:100] 337554 337554 337554 337554 337554 337551 337551 337551 337551 337534 ...
..$ forex_trans_t2 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ other_market_activities_t2 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ reversal_of_discount_window_t2 : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ interest_payment_issuance_efbn_t2 : chr [1:100] "+32" "+0" "+0" "+0" ...
..$ forecast_aggregate_bal_t2 : int [1:100] 337586 337554 337554 337554 337554 337534 337551 337551 337551 337541 ...
..$ forex_trans_t3 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ other_market_activities_t3 : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ reversal_of_discount_window_t3 : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ interest_payment_issuance_efbn_t3 : chr [1:100] "+0" "+29" "+0" "+0" ...
..$ forecast_aggregate_bal_t3 : int [1:100] 337586 337583 337554 337554 337554 337534 337534 337551 337551 337541 ...
..$ forex_trans_t4 : chr [1:100] NA NA NA NA ...
..$ other_market_activities_t4 : chr [1:100] NA NA NA NA ...
..$ reversal_of_discount_window_t4 : chr [1:100] NA NA NA NA ...
..$ interest_payment_issuance_efbn_t4 : chr [1:100] NA NA NA NA ...
..$ forecast_aggregate_bal_t4 : int [1:100] NA NA NA NA NA NA NA NA NA NA ...
..$ forex_trans_u : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ other_market_activities_u : chr [1:100] "+0" "+0" "+0" "+0" ...
..$ reversal_of_discount_window_u : chr [1:100] "-0" "-0" "-0" "-0" ...
..$ interest_payment_issuance_efbn_u : chr [1:100] "-107" "-104" "-75" "-75" ...
..$ forecast_aggregate_bal_u : int [1:100] 337479 337479 337479 337479 337479 337479 337479 337479 337479 337479 ...
interbank.records = interbank.liquidity$result$records %>% as_tibble()
interbank.records
interbank.records %>%
ggplot() +
geom_line(
mapping=aes(x=end_of_date, y=hibor_fixing_1m, group=1),
color="orange"
)
graduates %>% group_by(AcademicYear, LevelOfStudy) %>%
summarise(TotalHeadcount = sum(Headcount))
`summarise()` has grouped output by 'AcademicYear'. You can override using the
`.groups` argument.
graduates %>% group_by(AcademicYear, LevelOfStudy) %>%
summarise(TotalHeadcount = sum(Headcount)) %>%
ggplot(
aes(x=AcademicYear,
y=TotalHeadcount,
group=LevelOfStudy,
color=LevelOfStudy
)
) +
geom_line() +
geom_point()
`summarise()` has grouped output by 'AcademicYear'. You can override using the
`.groups` argument.
NA
Use filter() to keep only “Taught Postgraduate”
Records
This plot is not very useful without previously applying
filter() and group_by() and
summarise()
graduates %>%
filter(LevelOfStudy=="Taught Postgraduate") %>%
ggplot() +
geom_line(mapping=aes(x=AcademicYear,y=Headcount, group=ProgrammeCategory, color=ProgrammeCategory))
Use filter() to extract required rows Use
group_by() and summarise() to group and
aggregate total headcount for both male and female
graduates %>%
filter(LevelOfStudy=="Taught Postgraduate") %>%
group_by(AcademicYear, ProgrammeCategory) %>%
summarise(TotalHeadcount = sum(Headcount)) %>%
ggplot() +
geom_line(mapping=aes(x=AcademicYear,y=TotalHeadcount, group=ProgrammeCategory, color=ProgrammeCategory))
`summarise()` has grouped output by 'AcademicYear'. You can override using the
`.groups` argument.
# Following is the same chart for "undergraduate"
# graduates %>%
# filter(LevelOfStudy=="Undergraduate") %>%
# group_by(AcademicYear, ProgrammeCategory) %>%
# summarise(TotalHeadcount = sum(Headcount)) %>%
# ggplot() +
# geom_line(mapping=aes(x=AcademicYear,y=TotalHeadcount, group=ProgrammeCategory, color=ProgrammeCategory))
Center: mean(), median() Spread:
sd(), IQR(), mad() Range:
min(), max(), quantile()
Position: first(), last(), nth(),
Count: n(), n_distinct() Logical:
any(), all()
More information at summarise() function
bar chart give the counting frequency (number of record in the data set)
graduates %>%
ggplot() +
geom_bar(mapping=aes(x=AcademicYear)) # you only need to provide the x axis
The boxplot compactly displays the distribution of a continuous variable. It visualises five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually.
graduates %>%
ggplot() +
geom_point(mapping=aes(x=Sex, y=Headcount))
graduates %>%
ggplot() +
geom_boxplot(mapping=aes(x=LevelOfStudy, y=Headcount))
facets are useful categorical variables. It split your plot into subplot (a.k.a facets) that each display one subset of the data.
facet_wrap() lets you work with ONE extra variable (besides x and
y).
Each categorical value will used to produce to a sub plot. We use
‘LevelOfStudy’ here. Since there are FOUR distinct values, you will see
FOUR sub-plots
graduates %>% ggplot() +
geom_point(mapping = aes(x=AcademicYear, y=Headcount)) +
facet_wrap(~ LevelOfStudy) # You will see FOUR sub plots as there are FOUR distinct values for this variable.
facet_grid() lets you work with TWO extra variables (besides x and
y).
Each combination of these two variables’ value are used to produce to a
sub plot.
You should SEE 28 sub plots as there are FOUR distinct values for ‘LevelOfStudy’ and 7 distinct values for ‘ProgrammeCategory’.
Use of title, label, background color and themes
# in this example we save the plot to a variable name 'level.bar.plot' so that we can use it again and again.
level.bar.plot = graduates %>%
filter(ProgrammeCategory=="Engineering and Technology") %>%
ggplot() +
geom_col(mapping=aes(x=AcademicYear, y=Headcount, fill=LevelOfStudy))
# To show the plot, just use print() function with the previous saved plot variable as parameter.
print(level.bar.plot)
element_rect() is a function to generated rectangle geometry element. You have to specify the fill parameter by color name or hex code code by string
Plot Background refers to the big area of everything relevant to the plot.
level.bar.plot # default style
level.bar.plot +
theme(plot.background = element_rect(fill="orange")) # styling the plot background
Panel background refers to the inner area of plot. Area for showing header, axis label and legend are NOT included.
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_rect(fill="orange")) # styling the panel background
In visual design, color is a very powerful tool to guide users’ attention. But you have to use them carefully.
Too many colors will usually do the opposite - confuse the audience. Minimal design is the recent trend. Especially true when many are using small device like mobile phone or tablet for day-to-day communication.
In this example, we are removing both plot and panel background to achieve a clean design. After all, background is NOT the main dish. Very often background color causes distraction to graph.
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) # styling the grid line for y-axis
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") # Label for X axis
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(axis.text.x = element_text(angle = 45))
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
scale_fill_manual(values=c("purple", "orange", "blue", "tomato")) # use c() function to specify color list
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(legend.position="top") +
scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
guide = guide_legend(title="Level of Study",
label.position = "bottom")
) # move legend position to top and label position to bottom
NA
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(legend.position="top") +
scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
guide = guide_legend(title="Level of Study",
label.position = "bottom")
) + # move legend position to top and label position to bottom
ggtitle("Hong Kong Higher Education Student Headcount", subtitle="2009 - 2019")
Add extra texts/shape to enhance your visualization
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(legend.position="top") +
scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
guide = guide_legend(title="Level of Study",
label.position = "bottom")
) + # move legend position to top and label position to bottom
ggtitle("Hong Kong Higher Education Student Headcount", subtitle="2009 - 2020") +
annotate("text", label="Record\nHigh", x="2017/18", y=5300) # you can change text position value of x and y to set the text position
level.bar.plot # default style
level.bar.plot +
theme(panel.background = element_blank()) + # styling the panel background to none
theme(plot.background = element_blank()) + # styling the plot background to none
theme(panel.grid.major.y = element_line(color="grey")) + # styling the grid line for y-axis
ylab("Number of Student") + # Label for Y axis
xlab("Year") + # Label for X axis
theme(legend.position="top") +
scale_fill_manual(values=c("purple", "orange", "blue", "tomato"),
guide = guide_legend(title="Level of Study",
label.position = "bottom")
) + # move legend position to top and label position to bottom
ggtitle("Hong Kong Higher Education Student Headcount", subtitle="2009 - 2019") +
annotate("text", label="Record\nHigh", x="2017/18", y=5300) + # you can change text position value of x and y to set the text position
geom_hline(yintercept=3200) + # adds horizontal line
geom_vline(xintercept = "2017/18", color="white") # adds vertical line
NA
Install ggthemes package to unlock wider selections of themes.
browseURL("https://ggplot2.tidyverse.org/")
browseURL("https://exts.ggplot2.tidyverse.org/gallery")
Work Follow of Data Science
Data Science is combination of efforts and results of programming, mathematics and domain expertise. Among all, mathematics is the foundation of models. With models, data scientists make predictions; discover hidden patterns; and conclude insights.
Modeling is usually an iterative process among data transformation, data visualization, exploring with models and fitting.
Human are good in drawing conclusions and providing insight while are NOT good in directly facing large number of data attributes and huge volume of raw data.
Models Example
A model is mathematics expression that provides a simple low-dimensional summary of a data set so that we can draw conclusion and even provide insights.
Models only provide approximation (NOT the exact truth).
Let’s do some simple R coding to uncover the basic concept of model
if (!require("pacman")) install.packages("pacman") # install pacman
pacman::p_load(pacman, tidyverse, modelr, magrittr) # install (or load) required packages
Let’s use a simple built-in data set sim1 for exploring. In this simulation data, you can strongly see the pattern with the help of simple scatter-plot.
p_data(modelr) # display all the built-in data sets of modelr
Data Description
1 heights Height and income data.
2 sim1 Simple simulated datasets
3 sim2 Simple simulated datasets
4 sim3 Simple simulated datasets
5 sim4 Simple simulated datasets
print(heights)
?heights
print(sim1) # contains two continuous variables: x, y
ggplot(sim1, aes(x, y)) +
geom_point()
Linear line and quadratic curve are widely used to explore the relation of two variables. Let’s take linear model as simple example to grab the essence of model.
A linear model is described as y = a1 + x * a2 x and y
are the variable from data set a1 and a2 are parameters that can vary to
capture different patterns.
Let’s generate a random value of a1 as
intercept and a2 as
slope. Here, we use runinf() to
generate random uniform distributed number
Note: You might have to repeatively run a few times before you can see the visualized random model represented by a orange straight line.
model = tibble(
a1 = runif(1, -20, 40), # random intercept value between -20 to 40
a2 = runif(1, -5, 5) # random slop value between -5 to 5
)
# print(model) # un-comment to print the random model
ggplot(sim1, aes(x,y)) +
geom_point() + # this plots all the data
geom_abline(aes(intercept = a1, slope = a2),
data=model,
color="Orange"
) # this add the straight line, a.k.a, our random model on top to data layer
The number of potential models are unlimited. Let’s try to generate
250 random ones as candidate models.
Among these 250 models, some are very bad that you can judge even by
glancing. Some are not bad but we don’t know which one is the best among
them.
models = tibble(
a1 = runif(250, -20, 40), # 250 random intercept values between -20 to 40
a2 = runif(250, -5, 5) # 250 random slop values between -5 to 5
)
# let add these random linear models as overlay on top of data
ggplot(sim1, aes(x,y)) +
geom_point() +
geom_abline(aes(intercept = a1, slope = a2), data=models, alpha=0.2)
# this function calculates the modeled y value of each given 'x' value
modeled_y = function(a, data) {
a[1] + data$x * a[2] # a[1] is the intercept and a[2] is the slope
}
# modeled_y(c(7, 1.5), sim1) # test-run modeled_y function
# this function calculates the distance between an actual y value and the predicted y value (modeled_y)
measure_distance = function(mod, data) {
diff <- data$y - modeled_y(mod, data) # mod is random intercept and slope of a certain model
sqrt(mean(diff ^ 2)) # root-mean-squared deviation to compute overall distance
}
# measure_distance(c(7, 1,5), sim1) # test-run measure_distance function
# this function calculates the 'overall' distance for a given model with a1 as intercept and a2 as slope
sim1_dist = function(a1, a2) {
measure_distance(c(a1, a2), sim1) # a1 is the intercept of a model while a2 is the slope
}
# use map2_dbl (a mapping function) to a new column named 'dist' to each random model
models %<>%
mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models # now we have an extra column named 'dist' in our models
# plotting the best 10 models from the ranked distances
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a1, slope = a2, color = -dist) ,
data = filter(models, rank(dist) <= 10) # To show only the best 5, change 10 to five
)
If the previous R codes on choosing best 10 among 250 random models are too much digest. It’s fine. It’s just for you to feel the process and essence of models and fitting models.
In fact, R makes linear model fitting extremely easy by just one single line of function calling to the lm() function (a built-in linear model fitting function)
lm() finds the closest model in a single step, using a sophisticated algorithm that involves geometry, calculus, and linear algebra.
lm() has a special way to specify the model family: formulas. Formulas look like y ~ x, which lm() will translate to a function like y = a1 + a2 * x
# regular function calling manner
sim1_auto_model = lm(y ~ x, data = sim1) # finding the optimized linear model
# Or using piping below
sim1_auto_model = sim1 %>% lm(y ~ x, .) # since piping always assumes outputs from previous function call will be the first parameter of next function, here we use "." to indicate sim1 will be fed as second parameter
print(sim1_auto_model) # print out the auto generated linear model
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
4.221 2.052
print(summary(sim1_auto_model)) # print out the summary of the generated linear model
Call:
lm(formula = y ~ x, data = .)
Residuals:
Min 1Q Median 3Q Max
-4.1469 -1.5197 0.1331 1.4670 4.6516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.2208 0.8688 4.858 4.09e-05 ***
x 2.0515 0.1400 14.651 1.17e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.203 on 28 degrees of freedom
Multiple R-squared: 0.8846, Adjusted R-squared: 0.8805
F-statistic: 214.7 on 1 and 28 DF, p-value: 1.173e-14
sim1_coef = coef(sim1_auto_model) # retrieves model's intercept and slope
print(sim1_coef)
(Intercept) x
4.220822 2.051533
# visualize the auto generated linear model on top of the sim1 data
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = sim1_coef[1], slope = sim1_coef[2])
)
Making Prediction
new.data = data.frame(x = c(1,2,3,4,5,6,7,8,9,10))
predict(sim1_auto_model, new.data)
1 2 3 4 5 6 7 8
6.272355 8.323888 10.375421 12.426954 14.478487 16.530020 18.581553 20.633087
9 10
22.684620 24.736153
Using lm() function on a categorical variable will use
mean value for each category for prediction.
p_data(modelr) # shows included data sets with modelr package
Data Description
1 heights Height and income data.
2 sim1 Simple simulated datasets
3 sim2 Simple simulated datasets
4 sim3 Simple simulated datasets
5 sim4 Simple simulated datasets
sim2 %>% summary()
x y
Length:40 Min. :-0.9101
Class :character 1st Qu.: 1.3121
Mode :character Median : 3.4199
Mean : 4.3266
3rd Qu.: 6.9868
Max. :10.7554
sim2
model_cat = sim2 %>% lm(y~x, .)
print(model_cat)
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) xb xc xd
1.1522 6.9639 4.9750 0.7588
grid = sim2 %>%
data_grid(x) %>%
add_predictions(model_cat)
grid
ggplot(sim2, aes(x)) +
geom_point(aes(y = y)) +
geom_point(data = grid, aes(y = pred), colour = "red", size = 4)
sim3
Multiple regression is an extension of linear regression into relationship between more than two variables.
?mtcars
print(mtcars)
m.r.model = mtcars %>% lm(mpg~disp+hp+wt, .)
print(m.r.model)
Call:
lm(formula = mpg ~ disp + hp + wt, data = .)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
new.data.2 = data.frame(
disp = c(221),
hp = c(102),
wt = c(2.91)
) %>% print()
predict(m.r.model, new.data.2)
1
22.65987
predict(bi.model, new.data.3)
1
-10.79681
Visualization is usually thought as a tool for hypothesis generation to explore the hidden patterns among data while modelling is usually thought as at tool for hypothesis confirmation (to confirm what are found by data visualization tools). These two tools are suggested to used in an iterative manner so as to achieve a deeper revealing on data.
browseURL("https://www.tutorialspoint.com/ggplot2/ggplot2_introduction.htm")
ggplot2: Elegant Graphics for Data Analysis (Web-based E-book) ###
browseURL("https://ggplot2-book.org/")
A execellent ebook for R Data Science * Basic Concept of Model *
Model Building * Example on Model A web based ebook at
browseURL("https://r4ds.had.co.nz/model-intro.html")
browseURL("https://www.tutorialspoint.com/r/r_linear_regression.htm")
Many quick syntax reference for R programming and 3-rd packages
browseURL("https://www.rstudio.com/resources/cheatsheets/")
Advanced R
browseURL("https://adv-r.hadley.nz/index.html")